Language-specific Taggers and Parsers in the OPUS corpus
In the OPUS corpus, language-specific tools for tagging and parsing have been collected, and are available for download here: DownloadTools. For a consistent tagging and parsing procedure, the same tagging and parsing tools have been used for most of the languages, i.e. the Hunpos tagger (Péter Halácsy, András Kornai, Csaba Oravecz, 2007, Hunpos - an open source trigram tagger) and the Maltparser (Joakim Nivre and Johan Hall, 2005, Maltparser: A language-independent system for data-driven dependency parsing). For some languages, alternative taggers and/or parsers are used.
Czech
The tagger used for tagging Czech texts is the Hunpos tagger, trained on the Prague Dependency Treebank (PDT).
The parser used for parsing Czech texts is Maltparser, trained on the Prague Dependency Treebank (PDT). The Czech parsing model was provided by Marco Kuhlmann, Uppsala University (Marco Kuhlmann and Joakim Nivre, 2010, Transition-Based Techniques for Non-Projective Dependency Parsing).
Chinese
For Chinese, the Zpar parser is used for segmentation, tagging and parsing. The Chinese model was downloaded from http://sourceforge.net/projects/zpar/files/0.4/.
Danish
The tagger used for tagging Danish texts is the Hunpos tagger, trained on the Danish Dependency Treebank (http://www.id.cbs.dk/~mtk/treebank/).
The parser used for parsing Danish texts is Maltparser, trained on the Danish Dependency Treebank (DDT). Optimized settings were provided by Joakim Nivre, Uppsala University.
Dutch
The parser used for parsing Dutch texts is Maltparser, trained on the CDB corpus, i.e. the newspaper part of the Alpino Treebank (van Noord, 2006). The Dutch parsing model was provided by Barbara Plank, University of Groningen.
English
The tagger used for tagging English texts is the Hunpos tagger, trained on the Wall Street Journal section of the Penn Treebank. The English tagging model was downloaded from http://code.google.com/p/hunpos/downloads/list.
The parser used for parsing English texts is Maltparser, trained on the Wall Street Journal section of the Penn Treebank extended with about 4000 questions from the Question Bank, converted to dependency trees using the Stanford Parser. The English parsing model was downloaded from http://maltparser.org/mco/english_parser/engmalt.html.
French
The tagger used for tagging French texts is the MElt tagger (Denis and Sagôt, 2009, Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort). The French tagging model was downloaded from https://gforge.inria.fr/frs/download.php/27240/melt-0.6.tar.gz.
The parser used for parsing French texts is Maltparser, trained on a dependency version of the French Treebank. The French parsing model was downloaded from http://maltparser.org/mco/french_parser/fremalt.html.
German
The parser used for parsing German texts is Maltparser, trained on the Tiger Treebank. The German parsing model was provided by Marco Kuhlmann, Uppsala University (Marco Kuhlmann and Joakim Nivre, 2010, Transition-Based Techniques for Non-Projective Dependency Parsing).
Hungarian
The tagger used for tagging Hungarian texts is the Hunpos tagger. The Hungarian tagging model was downloaded from http://code.google.com/p/hunpos/downloads/list.
Italian
Pre-processing tools and taggers for Italian are bundled in TextPro. The parser is trained with MaltParser?.
Portuguese
The tagger used for tagging Portuguese texts is the Hunpos tagger, trained on the Floresta corpus.
The parser used for parsing Portuguese texts is Maltparser, trained on the Floresta corpus.
Russian
The tagger used for tagging Russian texts is the Hunpos tagger.
The parser used for parsing Russian texts is Maltparser.
Slovene
The tagger used for tagging Slovene texts is the Hunpos tagger, trained on the jos100k corpus, version 2.0. Training data was provided by Tomaž Erjavec, Department of Knowledge Technologies, Jozef Štefan Institute.
The parser used for parsing Slovene texts is Malt Parser, trained on the jos100k corpus, version 2.0. Training data was provided by Tomaž Erjavec, Department of Knowledge Technologies, Jozef Štefan Institute.
Spanish
The tagger used for tagging Spanish texts is the SVMTool (Jesús Giménez and Lluis Màrquez, 2004, SVMTool: A general POS tagger generator based on Support Vector Machines), trained on the Ancora corpus. The Spanish tagging model was provided by Jesús Giménez, Universitat Politècnica de Catalunya Barcelona Tech.
The parser used for parsing spanish texts is Maltparser, trained on the Ancora corpus. The Spanish parsing model was provided by Jesús Giménez.
Swedish
The tagger used for tagging Swedish texts is the Hunpos tagger, trained on the SUC corpus, version 2.0. The Swedish tagging model was provided by Uppsala University.
The parser used for parsing Swedish texts is Maltparser, trained on the Talbanken section of the Swedish Treebank. The Swedish parsing model was downloaded from http://maltparser.org/mco/swedish_parser/swemalt.html.
Turkish
For the morpho-syntactic annotation of Turkish texts, a morphological segmenter and analyser developed by Kemal Oflazer is used, leaving ambiguous tokens (Kemal Oflazer, 1994, Two-level description of Turkish morphology). The disambiguation is performed using the disambiguator described by Yüret and Türe (Deniz Yüret and Ferhan Türe, 2006, Learning morphological disambiguation rules for Turkish).
The parser used for parsing Turkish texts is Maltparser, with a pre-trained Turkish model provided by Gülsen Eryigit, İstanbul Teknik Üniversitesi (J. Eryigit G., Nivre and K. Oflazer, 2006, The incremental use of morphological information and lexicalization in data-driven dependency parsing).